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Abstract 

This report concerns the problem of dimensionality reduction through information geometric meth- 
ods on statistical manifolds. While there has been considerable work recently presented regarding dimen- 
sionality reduction for the purposes of learning tasks such as classification, clustering, and visualization, 
these methods have focused primarily on Riemannian manifolds in Euclidean space. While sufficient 
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\^ ' for many applications, there are many high-dimensional signals which have no straightforward and 
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meaningful Euclidean representation. In these cases, signals may be more appropriately represented as 
a realization of some distribution lying on a statistical manifold, or a manifold of probability density 



■ functions (PDFs). We present a framework for dimensionality reduction that uses information geometry 



for both statistical manifold reconstruction as well as dimensionality reduction in the data domain. 

I. Introduction 

In the recent past, sensing and media storage capabilities have enabled the generation of 
enormous amounts of information, often in the form of high-dimensional data. This is easily 
viewed within sensor networks, imaging, and biomedical applications such as flow cytometry and 
gene micro-arrays. While this vast amount of retrieved data has opened a wealth of opportunities 
for data analysis, the problem of the curse of dimensionality has become more substantial. 
The high dimensional nature of data is often simply a product of its representation. In many 
instances data dimensions are redundant and entirely correlated with some combination of other 
dimensions within the same data set. In these instances, although the retrieved data seems to 
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exhibit a naturally high dimension, it is actually constrained to a lower dimensional subset - 
manifold - of the measurement space. This allows for significant dimension reduction with minor 
or no loss of information. 

Often data does not exhibit a low intrinsic dimension in the data domain as one would have in 
manifold learning in Euclidean space. A straightforward strategy is to express the data in terms 
of a low-dimensional feature vector to alleviate the dimensionality issue. This initial processing 
of data as real- valued feature vectors in Euclidean space, which is often carried out in an ad hoc 
manner, has been called the "dirty laundry" of machine learning |l9l. This procedure is highly 
dependent on having a good model for the data, and in the absence of such model may be highly 
suboptimal. 

In this report we discuss an information geometric framework for dimensionality reduction. 
We view high-dimensional data sets as realizations of some generative model, or probability 
density function (PDF). Rather than dealing with Riemannian manifolds in a Euclidean space, 
we focus our attention on statistical manifolds for our geometric constructs. We offer two forms 
of dimensionality reduction: one which embeds these high-dimensional data sets into a single 
low-dimensional representation in Euclidean space by reconstructing the statistical manifold. 
This is performed by multidimensional scaling using information geometric measures of distance 
between PDFs. Secondly, we offer dimensionality reduction in the data domain by preserving 
the high-dimensionality similarities between data PDFs in the low-dimensional subspace. This 
is useful for both visualization and variable selection of high-dimensional data. 

II. Background on Information Geometry 

Information geometry is a field that has emerged from the study of geometrical constructs on 
manifolds of probability distributions. These investigations analyze probability distributions as 
geometrical structures in a Riemannian space. Using tools and methods deriving from differential 
geometry, information geometry is applicable to information theory, probability theory, and 
statistics. The field of information theory is largely based on the works of Shun'ichi Amari 
and has been used for analysis in such fields as statistical inference, neural networks, and control 
systems. In this section, we will give a brief background on the methods of information geometry 
utilized throughout the rest of this report. For a more thorough introduction to information 
geometry, we suggest lfT6l . ||2l. 
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A. Differential Manifolds 

The concept of a differential manifold is similar to that of a smooth curve or surface lying 
in a high-dimensional space. A manifold 7V1 can be intuitively thought of as a set of points 
with a coordinate system. These points can be from a variety of constructs, such as Euclidean 
coordinates, linear system, images, or probability distributions. Regardless of the definition of 
the points in the manifold Ai, there exists a coordinate system with a one-to-one mapping from 
Ai to W^, and as such, d is known as the dimension of Ai. 

For reference, we will refer to the coordinate system on as : ^ W^. If ^ has AI as 
its domain, we call it a global coordinate system [2J. In this situation, ^ is a one-to-one mapping 
onto for all points in A manifold is differentiable if the coordinate system mapping ip 
is differentiable over its entire domain. If ^ is infinitely differentiable, the manifold is said to 
be 'smooth' [16]. 

In many cases there does not exist a global coordinate system. Examples of such manifolds 
include the surface of a sphere, the "swiss roll", and the torus. For these manifolds, there are only 
local coordinate systems. Intuitively, a local coordinate system acts as a global coordinate system 
for a local neighborhood of the manifold, and there may be many local coordinate systems for 
a particular manifold. Fortunately, since a local coordinate system contains the same properties 
as a global coordinate system (only on a local level), analysis is consistent between the two. As 
such, we shall focus solely on manifolds with a global coordinate system. 

1) Statistical Manifolds: Let us now present the notion of statistical manifolds, or a set Ai 
whose elements are probability distributions. A probability density function (PDF) on a set X 
is defined as a function p : X ^ in which 



We describe only the case for continuum on the set X, however if X was discrete valued, 
equation ([U) will still apply by switching J p{x) dx = 1 with J2pi^) = 1- If consider A4 
to be a family of PDFs on the set X, in which each element of is a PDF which can be 
parameterized by 6' = [^^^, . . . , ^"], then Ai is known as a statistical model on X. Specifically, 



p{x) >0,WxeX 




(1) 
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let 

M = {p{x \ e)\eeec m^^}, (2) 

with p{x I 9) satisfying the equations in ([T]). Additionally, there exists a one-to-one mapping 
between 6 and p{x \ 6). 

Given certain properties of the parameterization of Ai, such as differentiability and C°° 
diffeomorphism (details of which are described in [2J), the parameterization 9 is also a coordinate 
system of A^. In this case, is known as a statistical manifold. In the rest of this report, we 
will use the terms 'manifold' and 'statistical manifold' interchangeably. 

B. Fisher Information Distance 

The Fisher information metric measures the amount of information a random variable X 
contains in reference to an unknown parameter 9. For the single parameter case it is defined as 

d 



I{9) = E 



89 



log/(X; 



\9 



If the condition J ^/(X; 9) dX = is met, then the above equation can be written as 



I{9) = -E 



d9 



For the case of multiple parameters 9 = [9^ , . . . , 9"-], we define the Fisher information matrix 
[T{9)], whose elements consist of the Fisher information with respect to specified parameters, 
as 

For a parametric family of probability distributions, it is possible to define a Riemannian 
metric using the Fisher information matrix, known as the information metric. The information 
metric distance, or Fisher information distance, between two distributions p{x; 9i) and p{x] 92) 
in a single parameter family is 

Df{9^,92) = ri{9f'H9, (4) 

where 9i and 6^2 are parameter values corresponding to the two PDFs and T{9) is the Fisher 
information for the parameter 9. Extending to the multi-parameter case, we obtain: 

e(o)=ei 
e{i)=e2 



(5) 
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where 6 = 9(t) is the parameter path along the manifold. Note that the coordinate system of 
a statistical manifold is the same as the parameterization of the PDFs (i.e. 9). Essentially, ^ 
amounts to finding the length of the shortest path - the geodesic - on connecting coordinates 
01 and 6*2. 

C. Approximation of Fisher Information Distance 

The Fisher information distance is a consistent metric, regardless of the parameterization of the 
manifold [l25l . This fact enables the approximation of the information distance when the specific 
parameterization of the manifold is unknown, and there have been many metrics developed for 
this approximation. An important class of such divergences is known as the /-divergence fS], 
in which f{u) is a convex function on n > and 



DjipU) = I Pi^)f (^) dx. 



A specific and important example of the /-divergence is the a-divergence, where D^"^ = Dj-^a) 
for a real number a. The function f^"\u) is defined as 

^ (1 - a^±l 
u log u a = 1 ■ 

— logM a = —1 



As such, the ct-divergence can be evaluated as 



D^'^^ipk) = (l - / Pix^-^qixy-^dx^ a ^ ±1, 

and 

D^-'\p\\q) = D^'\q\\p) = [ p(x)\og^^dx. (6) 

The a-divergence is the basis for many important and well known divergence metrics, such as 
the KuUback-Leibler divergence and the Hellinger distance. 

1 ) Kullback-Leibler Divergence: The KuUback-Leibler (KL) divergence is defined as 

/p{x^ 
Pix) log -^^dx, (7) 

which is equal to D^^^^ The KL-divergence is a very important metric in information theory, 
and is commonly referred to as the relative entropy of one PDF to another. Kass and Vos show 
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[fT6l the relation between the KuUback-Leibler divergence and the Fisher information distance 
is 

^/2KL{p\\q)^DF{p,q) 

as p ^ q. This allows for an approximation of the Fisher information distance, through the use 
of the available PDFs, without the need for the specific parameterization of the manifold. 

It should be noted that the KL-divergence is not a distance metric, as it does not satisfy the 
symmetry, KL{p\\q) ^ KL(p\\q), or triangle inequality properties of a distance metric. To obtain 
symmetry, we will define the symmetric KL-divergence as: 

Dkl{p, q) = KL{p\\q) + KL{q\\p), (8) 

which is symmetric, but still not a distance as it does not satisfy the triangle inequality. Since 
the Fisher information is a symmetric measure, we can relate the symmetric KL-divergence and 
approximate the Fisher information distance as 

^DKL{p,q)^DF{p,q), (9) 

as p g. 

2) Hellinger Distance: Another important result of the ^-divergence is the evaluation with 

a = 0: 

D(o)(p||g) = 2 j (^y^) - dx, 
which is called the closely related to the Hellinger distance, 

which satisfies the axioms of distance - symmetry and the triangle inequality. The Hellinger 
distance is related to the information distance in the limit by 

2DH{p,q)^DFip,q) 

as p — > g lfT6l . We note that the Hellinger distance is related to the KuUback-Leibler divergence, 
as in the limit ^yKL{p\\q) Dh{p, q). 



September 29, 2008 



DRAFT 



7 



3) Other Fisher Approximations: There are other metrics which approximate the Fisher 
information distance, such as the cosine distance. When dealing with multinomial distributions, 
the approximation 

Dcip, ,) = 2 arccos / yW)^)dx ^ D^P, 

is the natural metric on the sphere. 

Throughout this report we restrict our analysis to that of the KuUback-Leibler divergence 
and the Hellinger distance. The KL-divergence is a great means of differentiating shapes of 
continuous PDFs. Analysis of © shows that as p(x)/q{x) oo, KL{p\\q) oo. These 
properties ensure that the KL-divergence will be amplified in regions where there is a significant 
difference in the probability distributions. This cannot be used in the case of a multinomial PDF, 
however, because of divide-by-zero issues. In that case the Hellinger distance is the desired 
metric as there exists a monotonic transformation function ^ : Dh Dc fll61 . Specific details 
on how we nonparametrically calculate these information divergences is provided in Section 
IVl For additional measures of probabilistic distance, some of which approximate the Fisher 
information distance, and a means of calculating them between data sets, we refer the reader to 

m. 

III. Fisher Information Nonparametric Embedding 

Many applications of statistical manifolds have proved promising, such as document classifica- 
tion [|T9l . llTSll . BH, flow cytometry analysis [fT2|. ||5l, face recognition [O, texture segmentation 
[|20l . image analysis [|25l . clustering [[24|. and shape analysis [fTTl . While all have proposed 
alternatives to using Euclidean geometry for data modeling, most methods (outside of our own 
work) focus on clustering and classification, and do not explicitly address the problems of 
dimensionality reduction and visualization. Additionally, most presented work has been in the 
parametric setting, in which parameter estimation is a necessity for the various methods. This 
becomes ad-hoc and potentially troublesome if a good model is unavailable. 

We provide a method of dimensionality reduction - deemed Fisher Information Nonparametric 
Embedding (FINE) - which approaches the problem of statistical manifold reconstruction. This 
method includes a characterization of data sets in terms of a nonparametric statistical model, a 
geodesic approximation of the Fisher information distance as a metric for evaluating similarities 
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between data sets, and a dimensionality reduction procedure to obtain a low-dimensional Eu- 
clidean embedding of the original high-dimensional data set for the purposes of both classification 
and visualization. This non-linear embedding is driven by information, not Euclidean, geometry. 
Our methods require no explicit model assumptions; only that the given data is a realization 
from an unknown model with some natural parameterization. 

A. Approximation of Distance on Statistical Manifolds 

Let us consider the approximation function Df{pi,P2) of the Fisher information distance 
between pi and p2, which may can be calculated using a variety of metrics as pi p2 (see 
Section III-CI) . If pi and p2 do not lie closely together on the manifold, these approximations 
become weak.A good approximation can still be achieved if the manifold is densely sampled 
between the two end points. By defining the path between pi and p2 as a series of connected 
segments and summing the length of those segments, we may approximate the length of the 
geodesic with graphical methods. Specifically, given the set of n PDFs parameterized by Ve = 
{6'i, . . . , 9n}, the Fisher information distance between pi and p2 can be estimated as: 



Df{pi,P2)^ min V ^ V i 

A/,{e(i),...,e(„)} .^^ 

where p{0(i)) = Pi, p{0(^m)) = P2, {6'(i), . . . , 6(^m)} e Ve, and M < N. 

Using an approximation of the Fisher information distance as pi ^ P2, we can now define an 
approximation function G for all pairs of PDFs: 



where V = {pi, . . . ,Pn} is the available collection of PDFs on the manifold. Intuitively, this 
estimate calculates the length of the shortest path between points in a connected graph on the 
well sampled manifold, and as such G{pi,p2;V) Dp{pi,p2) as oo. This is similar to 

the manner in which Isomap [12611 approximates distances on Riemannian manifolds in Euclidean 
space. 

B. Dimensionality Reduction 

Given a matrix of dissimilarities between entities, many algorithms have been developed to 
find a low-dimensional embedding of the original data ^ : Ai ^ W^. These techniques have 
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been classified as a group of metliods called multidimensional scaling (MDS) JTl. There are 
supervised methods [|T3l . [1211 . Il22l . [fT4l which are generally used for classification purposes, 
and unsupervised methods [|23l . BUl, which are often used for clustering and manifold learning. In 
conjunction with the matrix of Fisher information distance approximations, these MDS methods 
allows us to find a single low-dimensional coordinate representation of each high-dimensional, 
large sample, data set. We do not highlight the heavily utilized Isomap [|26l algorithm since it 
is identical to using classical MDS on the approximation of the geodesic distances. 

C. FINE Algorithm 

We have identified a series of methods for manifold learning developed in the field of infor- 
mation geometry. By performing dimensionality reduction on a family of data sets, we are able 
to both better visualize and classify the data. In order to obtain a lower dimensional embedding, 
we calculate a dissimilarity metric between data sets within the family by approximating the 
Fisher information distance between their corresponding PDFs. 

In problems of practical interest, however, the parameterization of the probability densities 
is usually unknown. We instead are given a family of data sets X = {Xi, . . . , X^r}, in 
which we may assume that each data set Xj is a realization of some underlying probability 
distribution to which we do not have knowledge of the parameters. As such, we rely on 
nonparametric techniques to estimate both the probability density and the approximation of the 
Fisher information distance. In the work presented in this report, we implement kernel density 
estimation methods (see Appendix |Al), although fc-NN methods are also applicable. Following 
these approximations, we are able to perform the same multidimensional scaling operations as 
previously described. 

Fisher Information Nonparametric Embedding (FINE) is presented in Algorithm [T] and com- 
bines all of the presented methods in order to find a low-dimensional embedding of a collection 
of data sets. If we assume each data set is a realization of an underlying PDF, and each of 
those distributions lie on a manifold with some natural parameterization, then this embedding 
can be viewed as an embedding of the actual manifold into Euclidean space. Note that in line 
[H 'mds(G,dy refers to using any multidimensional scaling method to embed the dissimilarity 
matrix G into a Euclidean space with dimension d. 

At this point it is worth stressing the benefits of this framework. Through information geometry. 



September 29, 2008 



DRAFT 



10 



Algorithm 1 Fisher Information Nonparametric Embedding 

Input: Collection of data sets X = {Xi, . . . , X n}; the desired embedding dimension d 
1: for z = 1 to do 

2: Calculate Pi{x), the density estimate of Xi 
3: end for 

4: Calculate G, where G{i,j) is the geodesic approximation of the Fisher information distance 
between pi and pj (flOl) 

5: 1^ = mds(G', d) 

Output: (i-dimensional embedding of X, into Euclidean space Y E M.'^^^ 



FINE enables the joint embedding of multiple data sets Xi into a single low-dimensional 
Euclidean space. By viewing each Xi E X as a realization of pi E V, we reduce the nu- 
merous samples in Xi to a single point. The dimensionality of the statistical manifold may be 
significantly less than that of the Euclidean realizations. For example, a Gaussian distribution is 
entirely defined by its mean n and covariance S, leading to a 2-dimensional statistical manifold, 
while the dimensionality of the realization X ~ N{n, S) may be significantly larger (i.e. jj, E W^, 
d ^ 2). MDS methods reduce the dimensionality of Pi from the Euclidean dimension to the 
dimension of the statistical manifold on which it lies. This results in a single low-dimensional 
representation of each original data set Xi E X. 

IV. Information Preserving Component Analysis 

We now extend our framework to dimensionality reduction in the data domain, which is 
the standard setting for manifold learning. However, rather than focusing on the relationships 
between elements in a single data set X, it is often desirable to compare each set in a collection 
X = {Xi, . . . ,X]\f} in which X, has rii elements x E M'^. Once again, we wish to find a form 
of dimensionality reduction that preserves the information geometry of the statistical manifold 
of PDFs generating X. Unlike FINE, however, we now are interested in reduction of each Xi 
individually to a low-dimensional subspace which preserves the Fisher information distances 
between pi and pj, the estimated PDFs of Xi and Xj respectively. With an abuse of notation, 
we will further refer to Dp{pi, pj) as Dp(Xi, Xj) with the knowledge that the Fisher information 
distance is calculated with respect to PDFs, not realizations. 
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For this task, we are interested in linear and unsupervised methods, as class labels for X 
are often unavailable and linear projections will be useful for variable selection. We define the 
Information Preserving Component Analysis (IPCA) projection matrix A E M™^"', in which A 
reduces the dimension of X from d ^ m (m < d), such that 

Df{AX,,AX,) = Df{X,,X,), (11) 

Formatting as an optimization problem, we would like to solve: 

A = arg min J(A), (12) 

A:AAT=I 

where / is the identity matrix and J{A) is some cost function designed to implement (fTTI) . Note 
that we include the optimization constraint AA^ = I to ensure our projection is orthonormal, 
which keeps the data from scaling or skewing as that would undesirably distort the data. Let 
D(X) be a dissimilarity matrix such that Dij(X) = Dp{Xi, Xj), and D(X;A) is a similar 
matrix where the elements are perturbed by A, i.e. Dij(X;A) = Dp^AXi, AX j). We have 
formulated several different cost functions with differing benefits: 

MA) = \\D{X)-DiX;A)\\l (13) 

J,{A) = ||e-^W/--e-^(^;^)/^||^ (14) 

MA) = -\\D{X;A)\\l (15) 

MA) = II e-^(^^-^)/l||, (16) 

where || ■ \\f is the Frobenius norm and c is some constant. 

It should be clear that Ji{A) (fT3l) is a direct implementation of our stated objective. M^) 
(fT4l) applies a sense of locality to our objective, which is useful give that the approximations 
to the Fisher information distance are valid only as p — > g. In problems in which PDFs may 
significantly differ, this cost will prevent the algorithm be unnecessarily biased by PDFs which 
are very far away. The exponential kernel has the property of providing a larger weight to smaller 
distances. The final 2 cost functions, M^) ^31) and J4{A) (fT6l) . operate with the knowledge 
that the KuUback-Leibler divergence and Hellinger distance are strictly non-increasing given 
an orthonormal perturbation of the data. Hence, the projection matrix which best preserves the 
information divergence will be the one that maximizes the information divergence. M^) 
J4:{A) adapt Ji(v4) and M^)^ respectively, with this property - the negative sign on M^) 
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to coincide with the minimization from (fT2l) . The proofs of these properties are presented in 
Appendices |B] and O and extending them to the exact Fisher information distance is an area 
for future work. Note also that J2{A) and J4{A) are bounded in the range [0,A^^] due to the 
bounding of the negative exponential from [0, 1]. We note that a tighter bound on J2iA) and 
J^i^A) is [0, A^(A^ — 1)] since Da = 0, but as ^ oo the difference becomes negligible. 

While the choice of cost function is dependent on the problem, the overall projection method 
ensures that the similarity between data sets is maximally preserved in the desired low-dimensional 
space, allowing for comparative learning between sets. 

A. Gradient Descent 

Gradient descent (or the method of steepest descent) allows for the solution of convex opti- 
mization problems by traversing a surface or curve in the direction of greatest change, iterating 
until the minimum is reached. Specifically, let J(x) be a real-valued objective function which 
is differentiable about some point Xi. The direction in which J{x) decreases the fastest, from 
the point xi, is that of the negative gradient of J at xi, —^J{xi). By calculating the location 
of the next iteration point as 

- \ 
Xi fJjg^JyXij, 

where /i is a small number regulating the step size, we ensure that J{x.i) > J(xj_|_i). Continued 
iterations will result in J(x) converging to a local minimum. Gradient descent does not guarantee 
that the process will converge to an absolute minimum, so typically it is important to initialize 
Xo near the estimated minimum. 

Let J{A) be our objective function, measuring the error between our projected subspace and 
the full-dimensional space. The direction of the gradient is solved by taking the partial derivative 
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of J w.r.t. a projection matrix A: 

= E E 2 A) - D,,{X)) A) 

i j 

= (e-^-('^)/'= - e-^-('^^^)/'=) e-^-(^^^)/^ 

« j 

i j 

Given the direction of the gradient, the projection matrix can be updated as 

A = A-ix-^JiA), (17) 



where 



4jJ=4jJ--((4jA a^ + a(4jjY\a 

OA OA 2 WdA \dA 



is the direction of the gradient, constrained to force A to remain orthonormal (the derivation of 
this constraint can be found in Appendix ID)) . This process is iterated until the error J converges. 



B. IPCA Algorithm 

The full method for IPCA is described in Algorithm [2l We note that Ai is often initialized as 
a random orthonormal projection matrix as to not bias the estimation, but this carries the risk of 
converging to a local minimum. For certain applications it may be beneficial to initialize near 
some estimated global minimum if that information is available. At this point we stress that we 
utilize gradient descent due to its ease of implementation. There are more efficient methods of 
optimization, but that is out of the scope of the current contribution and is an area for future 
work. 
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Algorithm 2 Information Preserving Component Analysis 

Input: Collection of data sets X — {Xi, . . . , X^}, each of dimension d; the desired projection 
dimension m; search step size threshold e 
1: Calculate D{X), the Fisher information distance matrix 
2: Initialize Ai e ]R"*x<^ as some orthonormal projection matrix 
3: Calculate D(X; Ai), the Fisher information distance matrix in the projected space 
4: while I Jj — Jj_i| > e do 

5: Calculate ^J, the direction of the gradient, constrained to AA^ — I 

6: Ai^i — Ai — fJ'-^rJ 
7: Calculate D{X;Ai+i) 
8: Calculate J 

9: 2 = ^ + 1 

10: end while 

Output: IPCA Projection A e W^""^ 



C. Variable Selection 

IPCA may be used as a form of variable selection, as the loading vectors in the linear 
projection matrix A will be appropriately weighted towards the dimensions which best preserve 
the information distance between sets within the collection. For example, if two multivariate 
PDFs p and q are independent and identically distributed in a certain dimension, that dimension 
will offer zero contribution to the information distance between p and q. As such, the information 
distance is entirely defined by those areas of input space in which p and q differ. When finding 
a projection which preserves the information distance between p and q, A is going to be highly 
weighted towards the variables which contribute most to that distance. Hence, the loading vectors 
of A essentially give a ranking of the discriminative value of each variable. This form of variable 
selection is useful in exploratory data analysis. 

V. Implementation 

We now detail the calculation of the approximation of the Fisher information distance between 
two realizations of PDFs. Specifically, let Xf and Xg be realizations of PDFs f{x) and g{x) 
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respectively. Our goal is to calculate both an approximation of the Fisher information distance 
between the PDFs as well as the direction of the gradient with respect to a projection matrix A. 
Let us now illustrate the difficulties with these computations. 
Recall that the Hellinger distance (squared) is defined as 

DUf{x),gix)) = j (^./fi^ - dx. (18) 

Given the limited (and often unknown) support of x in both f(x) and g{x), it is appropriate to 
reformat this definition in terms of an expected value with respect to a single density f{x) or 

.i(/(.),.(.))4^(;-vf)/(^)-. 

These equations may be numerically approximated as follows: 

±. I 1 - 



Dl{f{x),g{x)) 



1 Y^ns I 1 




2 ) 



in which Uf and Ug are the number of samples x['^^ ^ ^i^d x^^' G Xg, and f{x) and 
g{x) are the kernel density estimates of PDFs f(x) and g(x) (see Section IV-DI and Appendix 
lAl) . The problem with these approximations is that they yield a non-symmetric estimate of the 
Hellinger distance D'jj{f{x),g{x)) ^ Djj{g{x), f{x)). Additionally, the estimate is unbounded 
from above. By definition the Hellinger distance should be symmetric and bounded by 2 (for 
the squared distance). 

When approximating the KuUback-Leibler divergence, a similar approach of formatting as an 
expectation may seem natural. The definition of the KL divergence 

KL{f{x)\\gix)) = [ f{x)\og^dx (20) 

in turn would be approximated as 



KLUWg) = -Y.^og 



tr g (xF^ 

Note that by definition the KL divergence is not necessarily symmetric, however it is strictly 
non-negative. This numerical approximation does not guarantee this non-negativity. 
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A. Metric Calculation 

We now detail our approximations which do not suffer from the aforementioned pitfalls. Define 

T(x) = ^^"^^ 

and 

f{x) = -^^^ . (21) 

f{x)+g{x) 

Note that < T{x) < 1. 

For simplicity, let us write / = f{x), g = g{x), and T — T{x). The Hellinger distance 
(squared) may be computed as follows: 



. 2 

(/ + 9)dx 



f n 



f + 9 Mf + g^ 

Vf-Vl-Tyfdx+ I (Vt - Vl-ry gdx. (22) 



Hence, we may now define our numerical approximation of the squared Hellinger distance as: 

^ 2 

DUf.Q) = —y'\xlT{xP]-,ll-f(xP 



" / / I 



which is both symmetric and bounded above by 2. 

The same formulation may be implemented when approximating the KuUback-Leibler diver- 
gence. Specifically, 

KL{f\\g) = f flog^dx 
J 9 

= Jt log dx + jT log ^g dx. (24) 
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Uf 

i$:f(.K))iog- 
^ ^ 1 



T ( xF^ 



rig (x^^M 



Hence 



which no longer suffers from the issue of potential negativity. This value is still non-symmetric, 
and when dealing with metric learning it is often desirable to have a symmetric dissimilarity 
metric. Hence, we implement the symmetric KL divergence ([8]) which maybe calculated in a 
similar manner: 

DKL{f.g) = KL{f\\g) + KL{g\\f) 

f log — dx + [ g log ^ dx 



9 J " °f 

if - 5-) log - dx 
9 



j (2T - 1) log dx + j (2T - 1) log ^^g dx, (26) 



yielding 



DKL{f,9) = -X^(2f (xF))-l)log ^f-^ 



T X 



+ - (2f (x?)) - 1) log ^f-^. (27) 

Although not previously discussed, we now illustrate another probabilistic distance measure, 
the Bhattacharya distance. This measure is a special case of the Chemoff distance, 

DcH = - log j f{xyg{xf-'dx, 

with s = \. These measure has been utilized for an upper bound on probability of error 
in classification problems regarding dimensionality reduction [15], and is therefore useful in 
comparative analysis between differing methods. The Bhattacharya distance may be numerically 
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formulated in the same manner as the KL divergence and Bellinger distance: 

= - log / \^V9 dx 



DBif,g) 



log 
log 



J y/T{l~T)fdx + j ^T{l-T)gdx 



X- 



if) 



l-T x) 



,(/) 



9 i=l 



(9) 



1 -T X 



(28) 



Note that the Bharracharya distance is directly related to the Hellinger distance by the transfor- 
mation Deif^g) = -log(l - ^Dnif^gf). 

In order to numerically approximate these information divergences (|23I25I27I28I) . we calculate 
T(x) as described in Section IV-DI 



B. Gradient Calculation 

In Section IV-AI we detailed expressions of the form 

which were used to numerically approximate the Hellinger distance and KuUback-Leibler diver- 
gence between PDFs f{x) and g{x). For simplicity, we write 

Uf ^ — ^ Ua ^ — ^ » 

Note that we do not continue with the Bhattacharya distance as we are mainly interested in that 
measure for final comparison to other dimension reduction methods, and we do not calculate the 
gradient w.r.t. this distance. If desired, the gradient may be calculated by a analytic transformation 
of the Hellinger distance gradient. 

The gradient of D w.r.t. some parameter 6, to which T yields some dependency, is defined as 

^-^V— — I ^V— — I 
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where 

§ = r(i-r)(|iog/-|iog«). (30) 

This derivation is explained in Appendix IB Substituting (l30l) into (|29l) . the gradient may be 
numerically approximated as 

-de = ;^I.^(^-^)9tU "^"^ "'J '^-'''^ 

( log/ log,)^.,. (31) 

Given this general setting, it is important to recognize that the only difference between the 
formulation of the Hellinger distance, KL divergence, and symmetric KL divergence is the 
definition of G(T). Hence, the formulation of the gradient is unchanged for all metrics, given a 
different definition of G{T). We now derive the value of T(l — T)|p for each metric. 

1) Hellinger Distance: From ^ we see that G{T) = (yT - Vl - ■ Therefore, 



dG 
df 



T 1-T 



1 -T 
2T- 1 



V(1-T)T' 

Hence, 

dG 

T(l - T)— = VT(1 - T) (2r - 1). (32) 

2) Kullback-Leibler Divergence: From (|24|) we see that G{T) = Tlog^. Therefore, 

dG . f T \ 1 

Of - '''[T^)^T^- ^''^ 

Hence, 

dG / T \ 

T(l - T)— = T(l - T) log (^^-^j + T. (34) 

For the symmetric KL divergence, (|26|) yields that G'(T) = (2T — 1) log Therefore, 

dG , f T \ 2T-1 
2 log 



dT ^\i-Tj r(i-r)' 

Hence, 

T(l-T)— = 2T(l-T)logf^-^j +2T-1. (35) 
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C. Matrix Gradient 



We now specify our abstraction to the specific task for IPCA, which is calculating the gradient 
of D w.r.t. the projection matrix A. First, let us derive ^ log f = -§x log in which f{Ax) 

may be estimated with kernel density estimation methods described in Appendix |Al with kernel 
locations x^- ' G Xf. 

d 



dA 



\ogf{Ax)\^u) 



d_ 
'dA 



log 



1 1 



2TiHf 



rU) 



-Hj^AC^f\x), 



Hy^A{x-xf){x-xy') 



\(x-x^P)TaThJ^A{x-x^P)) 



(36) 



where Hf is the kernel bandwidth for set X 



W, 



(/) 



exp {-\{x - xfYA^Hj^A{ 



EZi exp i-lix - x[^^yA^Hj^A{x - x 



and 



is the weighted sample covariance around x. In the same manner, 

^log(^(Ax))U,„ = -H-^AC^^\x), 



dA 



by evaluating the KDE with points x^^'^ G Xg. 

Finally, we may now define the gradient as follows 



nj 



OG 



^5:f(xP)(i-f(xP))9 



-{-H-'AC^^\x\f^)) 



X. 



-H7'AC^^\x[^^)) 



dT 



^ 1=1 



-{-H-^AC^^\x^^)) 



{-Hj^AC^f\x^f^)) 



(37) 
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D. Numerical Implementation 

1) PDFs and T{x): To estimate the PDFs /(x) and g{x), let us begin with the following 
definitions: 



(38) 



which in conjunction with the Gaussian KDE illustrated in Appendix El yield the density 
estimates: 

yf\2TiHf\ 

g{x)U.) = - — ^^=W^3''h, (39) 

where 1 is the vector of all ones and t^t is either / or g. Essentially, given the limited support 
of each Xf and Xg, we approximate the densities and their derivatives w.r.t. the samples in 
an appropriate set. Hence, is an element vector with elements equal to f{x^*^), 

x^*^ E X^. A similar interpretation holds for g{x)\^(*). 

Plugging the density estimates (|39l) into (|2TI) . we calculate our final estimates of T(x): 

f (x^^)) = ^ 1 ^ ^ + ^ ^ ) , (40) 

V^2^ Vv^2^ V^M^ 

where the notation ./ signifies element-by-element vector division. 

2) Gradient: We now describe the implementation of (l37l) . Specifically, let us numerically 
calculate 



dG 

yT[xr')[i-T[xr'), 

and extend towards the 3 other similar formulations such that 



= Zif, g) - Zif, f) + Zig, g) - Z{g, /). (41) 
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Let us continue (|38l) with the following additional definitions: 

dG 



(/,9) 



5'(/,9) 



(42) 



The formulation continues as follows: 

nt n 



Z{f,9) = lH;'AY,j:ni-f)p^,,,.W,f^\xi^^ -xf^^^^^ 

i=i j=i 

nf ng 



- X.iS^f'^Yx"; + X,diag((5(/'^))^l)Xj] . (43) 

Equation (l43l) and similar formulations (replacing / and g where appropriate) may be substi- 
tuted into (|4TI) to obtain the final calculation of the gradient -^D- 

VI. Conclusions 

The assumption that high-dimensional data lies on a Riemannian manifold in Euclidean space 
is often based on the ease of implementation due to the wealth of knowledge and methods 
based on Euclidean space. This assumption is not viable in many problems of practical interest, 
as there is often no straightforward and meaningful Euclidean representation of the data. In 
these situations it is more appropriate to assume the data is a realization of some PDF lying 
on a statistical manifold. Using information geometry, we have shown the ability to find a low- 
dimensional embedding of the manifold, which allows us to not only find the natural separation 
of the data, but to also reconstruct the original manifold and visualize it in a low-dimensional 
Euclidean space. This allows the use of many well known learning techniques which work based 
on the assumption of Euclidean data. 

We have also offered an unsupervised method of dimensionality reduction which preserves 
the information distances between high-dimensional data sets in the low-dimensional projection 
space. Information Preserving Component Analysis finds a low-dimensional projection space 
designed to maintain the similarities between data sets, which enables a comparative analysis in 
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how sets from different generative models occupy the projection space. Additionally, analysis 
of the loading vectors in the projection matrix allows for a means of variable selection, as the 
variables which are most crucial to preserving the information distances will have the largest 
loading values. 

We now stress that this framework is not meant to be the seminal result of information 
geometric dimensionality reduction, but a means for looking at common problems from a new 
angle. We have kept our formulations intentionally abstract as we make no claims as to which 
metric, implementation, or cost function is optimal. On the contrary, those choices are determined 

by the problem of interest. Instead, we have presented a framework to which others may tailor 
specifically to their needs, and we offer specific methods of implementation which may be 
immediately used. 



A. Appendix: Kernel Density Estimation 

We now illustrate the derivation of the kernel density estimate (KDE) of the PDF f{x) of 
the realization Xf. We utilize Gaussian kernels as the quadratic properties will be useful in 
implementation. Specifically, the KDE of a PDF is defined as 



where n/ is the number of sample points Xj E X j, K is some kernel satisfying the properties 

K{x) > 0, Vx e X, 

and h is the bandwidth or smoothing parameter. By utilizing the Gaussian kernel 



where S is the covariance of the kernel, we may combine the smoothing parameter vector h with 
S (ie. S = /) such that Hj = diag(/i). Note that we implement a vector bandwidth such that our 
Gaussian kernels are ellipses rather than spheres. There are a variety of methods for determining 



Appendix 




(44) 
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this bandwidth parameter; we choose to implement the maximal smoothing principle [|27l . This 

yields the a final kernel density estimate of 

1 \ 1 / 1 \ 

fix) = — > — , exp I — (x — xA^Hj^ix — Xj) I . (45) 

Let us now make the following definitions: 

of = {x - xff Hj\x - xf) 



mce 

D*^-^^ is the vector with elements d\^\ Substituting (|46|) into (l45l) . we obtain 



(f\ (f\ 
where Dj is a Mahalanobis distance between the point x and sample points Xj ^ Xf, and 



the KDE approximation of the PDF generating Xf. 

B. Appendix: Proof of strictly non-increasing property of KL divergence w.r.t. an orthonormal 
data projection 

Here we prove that the KL divergence between the PDFs of x and x' is greater or equal to 
the KL divergence between the PDFs of y = Ax and y' = Ax', respectively, where A satisfies 
AA^ = I. 

Theorem A. 1: Let RVs X,X' E R" have PDFs fx and fx', respectively. Using the dx n 
matrix A satisfying AA'^ = Id, construct RVs Y, Y' e R'^ such that Y = AX and Y' = AX'. 
The following relation holds: 

KL{fx\\fx')>KL{fy\\fy>), (48) 

where fy and fy/ are the PDFs of Y, Y', respectively. 

The proof is in two parts. First, we show that the KL divergence is constant over an arbitrary 
dimension preserving orthogonal transformation. Next, we show that the same truncation of two 
random vectors does not increase KL. 

Let M be an n X n orthonormal matrix, i.e., MM'^ = J„ and M'^M = In- Define the random 
vectors V, V E M" as follows V = MX and V = MX. By a change of variables, we have 

fv : Mv) = fxiM^'v) (49) 

fv : fv'{v') = fx'{M^v'). (50) 
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Note that the Jacobian of the transformation is 1 and is the inverse of the transformation 
both due to the orthonormality of M. The KL divergence between V and V is given by 

KL{fv\\fv') = [ fviv)\og^^dv. (51) 
J fv'{v) 

Substituting the PDFs from (|49l) into dSB, we have 

KLifvWfy,) = I fj,(M''v)\og^^^^dv. (52) 

Next, using the orthonormality of M we replace x = M'^v and dx = dv in (|52|) and obtain 

KLifvWfy,) = [ fx{x)\og^^dx = KLUxWfx') (53) 
J fx'{x) 

and the KL divergence remains the same. 

We proceed with the second part of the proof. Consider the random vector as a concatenation 

of RVs Y and Z: = [Y^Z^]. If we write matrix M as M'^ = [A^, B^] where A is d x n 

and B is {n-d)x n, then Y = AX, Y' = AX', and A satisfies AA^ = Id (but not A^A = /„). 

Since = [Y^Z^], we have KL{fv\\fv') = KLUyzWJy'z') and by virtue of ^ 

KL{fx\\fx') = KL{fYz\\fY'Z'). (54) 

Next, we use the following lemma: 

Lemma A.l: Let Y, Y' E R'^ and Z, Z' e R""'^ be RVs and denote: the joint PDF of Y and 
Z by fyz, the joint PDF of Y' and Z' by fy'z', the marginal PDF of Y by /y, the marginal 
PDF of Y' by fy, the conditional PDF of Z by fz\Y, and the conditional PDF of Z' by fz'\Y'. 
The following holds: 

j f{y)KL{fz\Y\\fz'\Y')dy = KLUyzVy'z') - KLifyWfy). (55) 
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This may be proven as follows: 




KLU{y,z)\\g{y,z))-KLU{y)\\g{y)). 



(56) 



Identifying that the LHS of (1551) is non-negative, we immediately obtain the following corollary: 

Corollary A.l: Let F, Y' e R'^ and Z, Z' e W^'^ be RVs and denote: the joint PDF of Y 
and Z by fyz, the joint PDF of Y' and Z' by fy'z', the marginal PDF of Y by fy, and the 
marginal PDF of Y' by fy/. The following holds: 



This corollary suggests that KL must not increase as a result of marginalization. Application of 
(|69l ) from Corollary IA.2I to (1541 ). yields the desired result 



C. Appendix: Proof of strictly non-increasing property of Hellinger distance w.r.t. an orthonor- 
mal data projection 

Here we prove that the Hellinger distance between the PDFs of x and x' is greater or equal 
to the Hellinger distance between the PDFs oi y = Ax and y' = Ax', respectively, where A 
satisfies AA^ = I. Note that much of this derivation is repetitive to that of the KL divergence 
in Appendix |Bj but we include all steps here for completeness. 
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Theorem A.2: Let RVs X,X' G M" have PDFs /x and fx', respectively. Using the d x n 
matrix A satisfying AA^ = h, construct RVs F, Y' e R'^ such that Y = AX and F' = AX'. 
The following relation holds: 

DH{fx,fx')>DH{fY,fY'), (59) 

where /y and fy' are the PDFs of Y, Y', respectively. 

The proof is in two parts. First, we show that the Hellinger distance is constant over an arbitrary 
dimension preserving orthogonal transformation. Next, we show that the same truncation of two 
random vectors does not increase the Hellinger distance. 

Let M be an n X n orthonormal matrix, i.e., MM^ = and M^M = /„. Define the random 
vectors V, V E M" as follows V = MX and V = MX. By a change of variables, we have 

fv : fv{v) = fx{M^v) (60) 
fv : fviv') = fx'iM^v'). (61) 

Note that the Jacobian of the transformation is 1 and is the inverse of the transformation 
both due to the orthonormality of M. The squared Hellinger distance between V and V' is given 
by 

D%ifvJv') = J {\^M^ - ^/f^y dv. (62) 
Substituting the PDFs from ( |60l ) into (|62l) . we have 

Dl{fvJv') = J [^Jfx{M^v)-sJfx'{M^v)f dv. (63) 
Next, using the orthonormality of M we replace x = M'^v and dx = dv in (|52|) and obtain 

DUfvJv') = J [^/hd^)-^/M^ydv = DlifxJx') (64) 

and the squared Hellinger distance remains the same. 

We proceed with the second part of the proof. Consider the random vector V as a concatenation 
of RVs Y and Z: = [Y^Z^]. If we write matrix M as M'^ = [A^, B^] where A is d x n 
and B is {n-d)x n, then Y = AX, Y' = AX', and A satisfies AA^ = Id (but not A'^A = /„). 
Since = [Y^Z^], we have Dj^ify, fv) = Dj^ifyz, fvz') and by virtue of dMl) 

DUfxJx') = DUfYZ,fY'Z'). (65) 
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Next, we use the following lemma: 

Lemma A.2: Let F, Y' G and Z, Z' G M""'^ be RVs and denote: the joint PDF of Y and 
Z by /y^, the joint PDF of Y' and by /y^s the marginal PDF of Y by /y, the marginal 
PDF of Y' by /y/, the conditional PDF of Z by fz\Y, and the conditional PDF of Z' by /^/ly/. 
The following holds: 

DIUyz. fv'Z') - DUfy, fv) > 0. (66) 
The proof this Lemma begins as follows: 

Dlifiy,z),giy,z))-Dlifiy),giy)) = 
Vfiy^z) - y/g{y,z)^ dydz- (^V fiv) - Vdi^j)) dy = 





f{y, z) + g{y, z) - 2^/ f{y,z)g{y,z)dydz - J f{y) + g{y) - 2^/ f{y)g{y)dy - 
We may now continue by showing J J ^J f{y, z)g{y, z) dy dz < J ^/ f{y)g{y) dy: 





V fiy^z)g{y,z)dydz =11 \/ f{y)f{z\y)g{y)g{z\y)dydz (67) 

V f{y)9{y)V f{z\y)g{z\y) dydz 




Vfiy)9iy) / Vfiz\y)giz\y)dz dy 



< I Vfiy)9iy) ( / VIW) dz) { I y^^) dzj dy 

(68) 

Vfiy)9{y) ( / f{z\y)dz )' { I g{z\y)dz ) ~ dy 



Vf{y)9{y){i)H^r'dy 



\/ f{y)9{y)dy. 



Note that (1671) used Bayes rule and (1681) used the Cauchy-Schwartz inequality. We now imme- 
diately obtain the following corollary: 
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Corollary A.2: Let Y, Y' G M"^ and Z, Z' e W^'^ be RVs and denote: the joint PDF of Y 
and Z by fyz, the joint PDF of Y' and Z' by /yzs the marginal PDF of Y by fy, and the 
marginal PDF of Y' by /y/. The following holds: 

DlifYzjY'Z')>DUfYjY'). (69) 

This corollary suggests that the squared Hellinger distance must not increase as a result of 
marginalization. Without loss of generality, due to the monotonic behavior of the square root 
function, the same may be said for the strict Hellinger distance, yielding the desired result 

DH{fxJx')>DH{fY,fY'). (70) 

D. Appendix: Orthonormality Constraint on Gradient Descent 

We derive the orthonormality constraint for our gradient descent optimization in the following 
manner; solving 

A = arer min J(A), 

A:AAT=I 

where I is the identity matrix. Using Lagrangian multiplier M, this is equivalent to solving 

A = argmin J(y4), 

where J (A) = J(A)+tT(A'^MA). We can iterate the projection matrix A, using gradient descent, 
as: 

d ~ 

A+i = Ai- fi—J{Ai), (71) 

where ^^(^4) = ^^(^4) + {M + M^)A is the gradient of the cost function w.r.t. matrix A. To 
ease notation, let A = ^J{Ai) and A = -^J{Ai). Continuing with the constraint Ai^iAf_^-^ = I, 



we right-multiply dVB by Aj_^_-^ and obtain 

= -/iA,A^ - fxAAj + fi^AA'^, 

^lAA^ = AA^ + AA^, (72) 
/i(A + (M + M^)A){A + (M + M^)Af = {AA{M + M^)A)A^ + A{AA^{M + M^)A). 
Let Q = M + M'^, hence A = A + QA. Substituting this into (|72l) we obtain: 
/x(AA^ + QAA^ + AA^Q + QQ^) = AA^ + AA'^ + 2Q. 
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Next we use the Taylor series expansion of Q around = 0: Q = Yl%.ol^^Qj- equating 
corresponding powers of /i (i.e. ^|/^=o = 0), we identify: 

gi = ^(A + go^)(A + go^r. 

Replacing the expansion of g in A = A + QA: 

A = A - ^(AA^ + A/\^)A + iiQ^A + ii^Q2A + .... 

Finally, we would like to assure a sufficiently small step size to control the error in forcing the 
constraint due to a finite Taylor series approximation of Q. Using the L2 norm of A allows us 
to calculate an upper bound on the Taylor series expansion: 

II All < ||A - i(AA^ + AA^)A\\ + \\QiA\\ + ^? \\Q2A\\ + . . . . 

We condition the norm of the first order term in the Taylor series approximation to be significantly 
smaller than the norm of the zeroth order term. If < || A - |(AA^ + AA^)y4||/||giy4|| then: 

l:nA, - - \ ( {§-^JiA)) Ar + A [l.J(Af) ) A (73, 

is a good approximation of the gradient constrained to AA^ = I. We omit the higher order 
terms as we experimentally find that they are unnecessary, especially as even /i^ 0. We note 
that while there are other methods for forcing the gradient to obey orthogonality [fTTTl . [fTOll . we 
find our method is straightforward and sufficient for our purposes. 
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E. Appendix: Gradient of T 

Calculation of the gradient of T — w.r.t. some parameter 6. Let F0 = for some 
arbitrary function F. 

= r^(iog/-iog(/ + ^)) 

U _ fe + ge 
f f + 9 
rp f feg - 99 f 



\fif + g) 

T ( (1 rj-\\^fd g go 



f f + gg 
= r((i-r)(iog/)e-(i-r)(iog^),) 

- ni-T)(|iog/-|iog, 
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